{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "LSA.ipynb",
      "version": "0.3.2",
      "provenance": [],
      "collapsed_sections": []
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "LOWANK49Pi27",
        "colab_type": "text"
      },
      "source": [
        "# 潜在语义分析（Latent semantic analysis, LSA）"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "0WD7XVWRPkX1",
        "colab_type": "text"
      },
      "source": [
        "**LSA** 是一种无监督学习方法，主要用于文本的话题分析，其特点是通过矩阵分解发现文本与单词之间的基于话题的语义关系。也称为潜在语义索引（Latent semantic indexing, LSI）。\n",
        "\n",
        "LSA 使用的是非概率的话题分析模型。将文本集合表示为**单词-文本矩阵**，对单词-文本矩阵进行**奇异值分解**，从而得到话题向量空间，以及文本在话题向量空间的表示。\n",
        "\n",
        "**非负矩阵分解**（non-negative matrix factorization, NMF）是另一种矩阵的因子分解方法，其特点是分解的矩阵非负。"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "P1sWKgTGQ7r-",
        "colab_type": "text"
      },
      "source": [
        "## 单词向量空间  \n",
        "word vector space model"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "CqXj1777RM8y",
        "colab_type": "text"
      },
      "source": [
        "给定一个文本，用一个向量表示该文本的”语义“， 向量的**每一维对应一个单词**，其数值为该单词在该文本中出现的频数或权值；基本假设是文本中所有单词的出现情况表示了文本的语义内容，文本集合中的每个文本都表示为一个向量，存在于一个向量空间；向量空间的度量，如内积或标准化**内积**表示文本之间的**相似度**。"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "3HVCXf6CSmTT",
        "colab_type": "text"
      },
      "source": [
        "给定一个含有$n$个文本的集合$D=({d_{1}, d_{2},...,d_{n}})$，以及在所有文本中出现的$m$个单词的集合$W=({w_{1},w_{2},...,w_{m}})$. 将单词在文本的出现的数据用一个单词-文本矩阵（word-document matrix）表示，记作$X$:\n",
        "\n",
        "$\n",
        "X = \\begin{bmatrix}\n",
        "x_{11} &  x_{12}&  x_{1n}& \\\\ \n",
        "x_{21}&  x_{22}&  x_{2n}& \\\\ \n",
        "\\vdots &  \\vdots &  \\vdots & \\\\ \n",
        "x_{m1}&  x_{m2}&  x_{mn}& \n",
        "\\end{bmatrix}\n",
        "$\n",
        "\n",
        "这是一个$m*n$矩阵，元素$x_{ij}$表示单词$w_{i}$在文本$d_{j}$中出现的频数或权值。由于单词的种类很多，而每个文本中出现单词的种类通常较少，所有单词-文本矩阵是一个稀疏矩阵。\n",
        "\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "K2ncB3cde1Ab",
        "colab_type": "text"
      },
      "source": [
        "权值通常用单词**频率-逆文本率**（term frequency-inverse document frequency, TF-IDF）表示：\n",
        "\n",
        "$TF-IDF(t, d ) = TF(t, d) * IDF(t)$,  \n",
        "\n",
        "其中，$TF(t,d)$为单词$t$在文本$d$中出现的概率，$IDF(t)$是逆文本率，用来衡量单词$t$对表示语义所起的重要性， \n",
        "\n",
        "$IDF(t) = log(\\frac{len(D)}{len(t \\in D) + 1})$."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "bpu7MycIgu65",
        "colab_type": "text"
      },
      "source": [
        "单词向量空间模型的优点是**模型简单，计算效率高**。因为单词向量通常是稀疏的，单词向量空间模型也有一定的局限性，体现在内积相似度未必能够准确表达两个文本的语义相似度上。因为自然语言的单词具有一词多义性（polysemy）及多词一义性（synonymy）。"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ns5wncZohn-z",
        "colab_type": "text"
      },
      "source": [
        "## 话题向量空间"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "mmZpPHIdhrAy",
        "colab_type": "text"
      },
      "source": [
        "**1. 话题向量空间**：\n",
        "\n",
        "给定一个含有$n$个文本的集合$D=({d_{1}, d_{2},...,d_{n}})$，以及在所有文本中出现的$m$个单词的集合$W=({w_{1},w_{2},...,w_{m}})$. 可以获得其单词-文本矩阵$X$:  \n",
        "\n",
        "$\n",
        "X = \\begin{bmatrix}\n",
        "x_{11} &  x_{12}&  x_{1n}& \\\\ \n",
        "x_{21}&  x_{22}&  x_{2n}& \\\\ \n",
        "\\vdots &  \\vdots &  \\vdots & \\\\ \n",
        "x_{m1}&  x_{m2}&  x_{mn}& \n",
        "\\end{bmatrix}\n",
        "$\n",
        "\n",
        "\n",
        "假设所有文本共含有$k$个话题。假设每个话题由一个定义在单词集合$W$上的$m$维向量表示，称为话题向量，即：  \n",
        "$t_{l} = \\begin{bmatrix}\n",
        "t_{1l}\\\\ \n",
        "t_{2l}\\\\ \n",
        "\\vdots \\\\ \n",
        "t_{ml}\\end{bmatrix},  l=1,2,...,k$\n",
        "\n",
        "其中$t_{il}$单词$w_{i}$在话题$t_{l}$的权值，$i=1,2,...,m$, 权值越大，该单词在该话题中的重要程度就越高。这$k$个话题向量 $t_{1},t_{2},...,t_{k}$张成一个话题向量空间（topic vector space）， 维数为$k$.**话题向量空间是单词向量空间的一个子空间**。\n",
        "\n",
        "话题向量空间$T$：  \n",
        "\n",
        "\n",
        "$\n",
        "T = \\begin{bmatrix}\n",
        "t_{11} &  t_{12}&  t_{1k}& \\\\ \n",
        "t_{21}&  t_{22}&  t_{2k}& \\\\ \n",
        "\\vdots &  \\vdots &  \\vdots & \\\\ \n",
        "t_{m1}&  t_{m2}&  t_{mk}& \n",
        "\\end{bmatrix}\n",
        "$  \n",
        "\n",
        "矩阵$T$,称为**单词-话题矩阵**。 $T = [t_{1},  t_{2}, ..., t_{k}]$"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Oc1c3JcKlTBD",
        "colab_type": "text"
      },
      "source": [
        "**2. 文本在话题向量空间中的表示**  ：\n",
        "\n",
        "考虑文本集合$D$的文本$d_{j}$, 在单词向量空间中由一个向量$x_{j}$表示，将$x_{j}$投影到话题向量空间$T$中，得到话题向量空间的一个向量$y_{j}$, $y_{j}$是一个$k$维向量：  \n",
        "\n",
        "$y_{j} = \\begin{bmatrix}\n",
        "y_{1j}\\\\ \n",
        "y_{2j}\\\\ \n",
        "\\vdots \\\\ \n",
        "y_{kj}\\end{bmatrix},  j=1,2,...,n$ \n",
        "\n",
        "其中，$y_{lj}$是文本$d_{j}$在话题$t_{l}$中的权值， $l = 1,2,..., k$, 权值越大，该话题在该文本中的重要程度就越高。  \n",
        "\n",
        "矩阵$Y$ 表示话题在文本中出现的情况，称为话题-文本矩阵（topic-document matrix）,记作：  \n",
        "\n",
        "$\n",
        "Y = \\begin{bmatrix}\n",
        "y_{11} &  y_{12}&  y_{1n}& \\\\ \n",
        "y_{21}&  y_{22}&  y_{2n}& \\\\ \n",
        "\\vdots &  \\vdots &  \\vdots & \\\\ \n",
        "y_{k1}&  y_{k2}&  y_{kn}& \n",
        "\\end{bmatrix}\n",
        "$  \n",
        "\n",
        "也可写成： $Y = [y_{1}, y_{2} ..., y_{n}]$"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "YcU3xwYindTo",
        "colab_type": "text"
      },
      "source": [
        "**3. 从单词向量空间到话题向量空间的线性变换**：  \n",
        "\n",
        "如此，单词向量空间的文本向量$x_{j}$可以通过他在话题空间中的向量$y_{j}$近似表示，具体地由$k$个话题向量以$y_{j}$为系数的线性组合近似表示：  \n",
        "\n",
        "$x_{j} = y_{1j}t_{1} + y_{2j}t_{2} + ... + y_{yj}t_{k}, j = 1,2,..., n$  \n",
        "\n",
        "所以，单词-文本矩阵$X$可以近似的表示为单词-话题矩阵$T$与话题-文本矩阵$Y$的乘积形式。\n",
        "\n",
        "$X \\approx TY$  \n",
        "\n",
        "直观上，潜在语义分析是将单词向量空间的表示通过线性变换转换为在话题向量空间中的表示。这个线性变换由矩阵因子分解式的形式体现。"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Cu4JekfXFMqs",
        "colab_type": "text"
      },
      "source": [
        "### 潜在语义分析算法  "
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "0awNXCy1Gw0K",
        "colab_type": "text"
      },
      "source": [
        "潜在语义分析利用矩阵奇异值分解，具体地，对单词-文本矩阵进行奇异值分解，将其左矩阵作为话题向量空间，将其对角矩阵与右矩阵的乘积作为文本在话题向量空间的表示。"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "otq3HMu5HVoK",
        "colab_type": "text"
      },
      "source": [
        "给定一个含有$n$个文本的集合$D=({d_{1}, d_{2},...,d_{n}})$，以及在所有文本中出现的$m$个单词的集合$W=({w_{1},w_{2},...,w_{m}})$. 可以获得其单词-文本矩阵$X$:  \n",
        "$\n",
        "X = \\begin{bmatrix}\n",
        "x_{11} &  x_{12}&  x_{1n}& \\\\ \n",
        "x_{21}&  x_{22}&  x_{2n}& \\\\ \n",
        "\\vdots &  \\vdots &  \\vdots & \\\\ \n",
        "x_{m1}&  x_{m2}&  x_{mn}& \n",
        "\\end{bmatrix}\n",
        "$\n",
        "\n",
        "\n",
        "\n",
        "\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "mwNGRDgrHmmV",
        "colab_type": "text"
      },
      "source": [
        "**截断奇异值分解**：\n",
        "\n",
        "潜在语义分析根据确定的话题数$k$对单词-文本矩阵$X$进行截断奇异值分解：  \n",
        "\n",
        "$\n",
        "X \\approx U_{k}\\Sigma _{k}V_{k}^{T} = \\begin{bmatrix}\n",
        "\\mu _{1} &  \\mu _{2}&  \\cdots & \\mu _{k}\n",
        "\\end{bmatrix}\\begin{bmatrix}\n",
        "\\sigma_{1} &  0&  0& 0\\\\ \n",
        " 0&  \\sigma_{2}&  0& 0\\\\ \n",
        " 0&  0&  \\ddots & 0\\\\ \n",
        " 0&  0&  0& \\sigma_{k}\n",
        "\\end{bmatrix}\\begin{bmatrix}\n",
        "v_{1}^{T}\\\\ \n",
        "v_{2}^{T}\\\\ \n",
        "\\vdots \\\\ \n",
        "v_{k}^{T}\\end{bmatrix}\n",
        "$\n",
        "\n",
        "矩阵$U_{k}$的每一个列向量 $u_{1}, u_{2},..., u_{k}$ 表示一个话题，称为**话题向量**。由这 $k$ 个话题向量张成一个子空间：  \n",
        "\n",
        "$\n",
        "U_{k} = \\begin{bmatrix}\n",
        "u_{1} &  u_{2}&  \\cdots & u_{k}\n",
        "\\end{bmatrix}\n",
        "$\n",
        "\n",
        "称为**话题向量空间**。  \n",
        "\n",
        "综上， 可以通过对单词-文本矩阵的奇异值分解进行潜在语义分析：  \n",
        "\n",
        "$ X \\approx U_{k} \\Sigma_{k} V_{k}^{T} = U_{k}(\\Sigma_{k}V_{k}^{T})$  \n",
        "\n",
        "得到话题空间 $U_{k}$ ， 以及文本在话题空间的表示($\\Sigma_{k}V_{k}^{T}$). "
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "UTNHyq8mK8l5",
        "colab_type": "text"
      },
      "source": [
        "### 非负矩阵分解算法"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "RQvqaMDYK_jf",
        "colab_type": "text"
      },
      "source": [
        "非负矩阵分解也可以用于话题分析。对单词-文本矩阵进行非负矩阵分解，将**其左矩阵作为话题向量空间**，将其**右矩阵作为文本在话题向量空间的表示**。"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ApM8tE3MLqpP",
        "colab_type": "text"
      },
      "source": [
        "#### 非负矩阵分解  "
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "glMwmkiwLyIn",
        "colab_type": "text"
      },
      "source": [
        "若一个矩阵的索引元素非负，则该矩阵为非负矩阵。若$X$是非负矩阵，则： $X >= 0$.  \n",
        "\n",
        "给定一个非负矩阵$X$, 找到两个非负矩阵$W >= 0$ 和 $H>= 0$, 使得：  \n",
        "\n",
        "$ X \\approx WH$\n",
        "\n",
        "即非负矩阵$X$分解为两个非负矩阵$W$和$H$的乘积形式，成为非负矩阵分解。因为$WH$与$X$完全相等很难实现，所以只要求近似相等。  \n",
        "\n",
        "假设非负矩阵$X$是$m\\times n$矩阵，非负矩阵$W$和$H$分别为 $m\\times k$ 矩阵和 $k\\times n$ 矩阵。假设 $k < min(m, n)$ 即$W$ 和 $H$ 小于原矩阵 $X$, 所以非负矩阵分解是对原数据的压缩。\n",
        "\n",
        "称 $W$ 为基矩阵， $H$ 为系数矩阵。非负矩阵分解旨在用较少的基向量，系数向量来表示为较大的数据矩阵。\n",
        "\n",
        "令 $W = \\begin{bmatrix}\n",
        "w_{1} &  w_{2}&  \\cdots& w_{k} \n",
        "\\end{bmatrix}$\n",
        "为话题向量空间， $w_{1}, w_{2}, ..., w_{k}$ 表示文本集合的 $k$ 个话题， 令 $H = \\begin{bmatrix}\n",
        "h_{1} &  h_{2}&  \\cdots& h_{n} \n",
        "\\end{bmatrix}$\n",
        "为文本在话题向量空间的表示， $h_{1}, h_{2},..., h_{n}$ 表示文本集合的 $n$ 个文本。"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "1DcvVSR0N_CF",
        "colab_type": "text"
      },
      "source": [
        "##### 算法"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "hvZyHT85O5qt",
        "colab_type": "text"
      },
      "source": [
        "非负矩阵分解可以形式化为最优化问题求解。可以利用平方损失或散度来作为损失函数。\n",
        "\n",
        "目标函数 $|| X - WH ||^{2}$ 关于 $W$ 和 $H$ 的最小化，满足约束条件 $W, H >= 0$, 即：  \n",
        "\n",
        "$\\underset{W,H}{min} || X - WH ||^{2}$  \n",
        "\n",
        "\n",
        "$s.t.  W, H >= 0$"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "-zIGS1AEQdWp",
        "colab_type": "text"
      },
      "source": [
        "乘法更新规则： \n",
        "\n",
        "\n",
        "$W_{il} \\leftarrow W_{il}\\frac{(XH^{T})_{il}}{(WHH^{T})_{il}}$  （17.33）\n",
        "\n",
        "\n",
        "$H_{lj} \\leftarrow H_{lj}\\frac{(W^{T}X)_{lj}}{(W^{T}WH)_{lj}}$  （17.34）\n",
        "\n",
        "\n",
        "选择初始矩阵 $W$ 和 $H$ 为非负矩阵，可以保证迭代过程及结果的矩阵 $W$ 和 $H$ 非负。"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "MeiA0REkRpRi",
        "colab_type": "text"
      },
      "source": [
        "**算法 17.1 （非负矩阵分解的迭代算法）**\n",
        "\n",
        "输入： 单词-文本矩阵 $X >= 0$, 文本集合的话题个数 $k$, 最大迭代次数 $t$；  \n",
        "输出： 话题矩阵 $W$, 文本表示矩阵 $H$。  \n",
        "\n",
        "**1)**. 初始化\n",
        "\n",
        "$W>=0$, 并对 $W$ 的每一列数据归一化；  \n",
        "$H>=0$；\n",
        "\n",
        "**2)**. 迭代  \n",
        "\n",
        "对迭代次数由1到$t$执行下列步骤：  \n",
        "a. 更新$W$的元素，对 $l$ 从1到 $k,i$从1到$m$按(17.33)更新 $W_{il}$;   \n",
        "a. 更新$H$的元素，对 $l$ 从1到 $k,j$从1到$m$按(17.34)更新 $H_{lj}$; "
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "rIw6a0HITg08",
        "colab_type": "text"
      },
      "source": [
        "### 图例 17.1"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "0hPH9VEMPVGu",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "import numpy as np\n",
        "from sklearn.decomposition import TruncatedSVD"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "kjHirYzQWItl",
        "colab_type": "code",
        "outputId": "0e6f2615-6a0b-4c4f-e74c-559727519eab",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 125
        }
      },
      "source": [
        "X = [[2, 0, 0, 0], [0, 2, 0, 0], [0, 0, 1, 0], [0, 0, 2, 3], [0, 0, 0, 1], [1, 2, 2, 1]]\n",
        "X = np.asarray(X);X"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "array([[2, 0, 0, 0],\n",
              "       [0, 2, 0, 0],\n",
              "       [0, 0, 1, 0],\n",
              "       [0, 0, 2, 3],\n",
              "       [0, 0, 0, 1],\n",
              "       [1, 2, 2, 1]])"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 2
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "I2yFnNJKWcPP",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# 奇异值分解\n",
        "U,sigma,VT=np.linalg.svd(X)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "ollDH_QNXAdY",
        "colab_type": "code",
        "outputId": "8d599d82-4bae-4047-941c-8ca524142349",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 233
        }
      },
      "source": [
        "U"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "array([[-7.84368672e-02,  2.84423033e-01,  8.94427191e-01,\n",
              "         2.15138396e-01, -2.68931121e-02, -2.56794523e-01],\n",
              "       [-1.56873734e-01,  5.68846066e-01, -4.47213595e-01,\n",
              "         4.30276793e-01, -5.37862243e-02, -5.13589047e-01],\n",
              "       [-1.42622354e-01, -1.37930417e-02,  4.16333634e-17,\n",
              "        -6.53519444e-01,  4.77828945e-01, -5.69263078e-01],\n",
              "       [-7.28804669e-01, -5.53499910e-01,  3.33066907e-16,\n",
              "         1.56161345e-01, -2.92700697e-01, -2.28957508e-01],\n",
              "       [-1.47853320e-01, -1.75304609e-01,  1.04083409e-16,\n",
              "         4.87733411e-01,  8.24315866e-01,  1.73283476e-01],\n",
              "       [-6.29190197e-01,  5.08166890e-01, -4.44089210e-16,\n",
              "        -2.81459486e-01,  5.37862243e-02,  5.13589047e-01]])"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 8
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "lmxB_JViXFAF",
        "colab_type": "code",
        "outputId": "9cdcba5c-74f8-42e4-9a4b-fad1c06b83cd",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 35
        }
      },
      "source": [
        "sigma"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "array([4.47696617, 2.7519661 , 2.        , 1.17620428])"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 13
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "AiXURUScXMsj",
        "colab_type": "code",
        "outputId": "d6fc0576-8bf1-491e-caed-02035079f0d3",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 161
        }
      },
      "source": [
        "VT"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "array([[-1.75579600e-01, -3.51159201e-01, -6.38515454e-01,\n",
              "        -6.61934313e-01],\n",
              "       [ 3.91361272e-01,  7.82722545e-01, -3.79579831e-02,\n",
              "        -4.82432341e-01],\n",
              "       [ 8.94427191e-01, -4.47213595e-01,  0.00000000e+00,\n",
              "         8.32667268e-17],\n",
              "       [ 1.26523351e-01,  2.53046702e-01, -7.68672366e-01,\n",
              "         5.73674125e-01]])"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 14
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "DKOxld5lXRCK",
        "colab_type": "code",
        "outputId": "0832796a-9952-4f39-e1ef-79347c495d16",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 53
        }
      },
      "source": [
        "# 截断奇异值分解\n",
        "\n",
        "svd = TruncatedSVD(n_components=3, n_iter=7, random_state=42)\n",
        "svd.fit(X)  "
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "TruncatedSVD(algorithm='randomized', n_components=3, n_iter=7, random_state=42,\n",
              "             tol=0.0)"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 16
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "btnGrF0LXzZI",
        "colab_type": "code",
        "outputId": "ba85127a-4fa6-4092-828c-9036c47f82f6",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 35
        }
      },
      "source": [
        "print(svd.explained_variance_ratio_)"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "[0.39945801 0.34585056 0.18861789]\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "F1hSe5NxX1zw",
        "colab_type": "code",
        "outputId": "b0d0b87d-195b-4653-a857-48ff1eca887a",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 35
        }
      },
      "source": [
        "print(svd.explained_variance_ratio_.sum())"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "0.9339264600284481\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "cV4L2i9WX30R",
        "colab_type": "code",
        "outputId": "6c313215-d095-41b2-a384-3dc1575729a2",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 35
        }
      },
      "source": [
        "print(svd.singular_values_)"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "[4.47696617 2.7519661  2.        ]\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "4CbG9kJXictK",
        "colab_type": "text"
      },
      "source": [
        "#### 非负矩阵分解"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "KcA2Rd4Df_DE",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "def inverse_transform(W, H):\n",
        "    # 重构\n",
        "    return W.dot(H)\n",
        "\n",
        "def loss(X, X_):\n",
        "    #计算重构误差\n",
        "    return ((X - X_) * (X - X_)).sum()"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "yRXZt6CfYPJq",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# 算法 17.1\n",
        "\n",
        "class MyNMF:\n",
        "    def fit(self, X, k, t):\n",
        "        m, n = X.shape\n",
        "        \n",
        "        W = np.random.rand(m, k)\n",
        "        W = W/W.sum(axis=0)\n",
        "        \n",
        "        H = np.random.rand(k, n)\n",
        "        \n",
        "        i = 1\n",
        "        while i < t:\n",
        "            \n",
        "            W = W * X.dot(H.T) / W.dot(H).dot(H.T)\n",
        "            \n",
        "            H = H * (W.T).dot(X) / (W.T).dot(W).dot(H)\n",
        "            \n",
        "            i += 1\n",
        "            \n",
        "        return W, H"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "zc1IFBBIajXk",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "model = MyNMF()\n",
        "W, H = model.fit(X, 3, 200)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "EYo7JyXXbNpJ",
        "colab_type": "code",
        "outputId": "716dbb03-5229-4c93-a75b-81d07ff4be85",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 125
        }
      },
      "source": [
        "W"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "array([[7.80747563e-282, 1.59350147e-085, 4.46818285e-001],\n",
              "       [1.36585053e-173, 1.82432253e-099, 8.93962574e-001],\n",
              "       [5.15393770e-058, 5.75011993e-001, 4.19426683e-039],\n",
              "       [1.73830597e+000, 1.09961986e+000, 4.87814473e-031],\n",
              "       [5.89525831e-001, 3.53091403e-065, 1.51262003e-035],\n",
              "       [5.54346760e-001, 1.12753836e+000, 1.11381284e+000]])"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 113
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "1cEFDsgXbnXZ",
        "colab_type": "code",
        "outputId": "e3c8eaf0-bcd8-48f5-edee-57f643b07fed",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 71
        }
      },
      "source": [
        "H"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "array([[3.02557029e-05, 2.18916926e-04, 5.10981068e-02, 1.69486742e+00],\n",
              "       [3.08284998e-03, 5.45494813e-03, 1.73785466e+00, 4.89822454e-02],\n",
              "       [8.94680268e-01, 1.79000896e+00, 1.09648099e-02, 4.54347640e-03]])"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 114
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "JFqGat0JdVlL",
        "colab_type": "code",
        "outputId": "567c82ff-997f-49e7-b829-f9d4c828f9c9",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 125
        }
      },
      "source": [
        "# 重构\n",
        "X_ = inverse_transform(W, H);X_"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "array([[3.99759503e-01, 7.99808736e-01, 4.89927756e-03, 2.03010833e-03],\n",
              "       [7.99810675e-01, 1.60020102e+00, 9.80212969e-03, 4.06169786e-03],\n",
              "       [1.77267571e-03, 3.13666059e-03, 9.99287272e-01, 2.81653785e-02],\n",
              "       [3.44255674e-03, 6.37891391e-03, 1.99980365e+00, 3.00006000e+00],\n",
              "       [1.78365184e-05, 1.29057183e-04, 3.01236539e-02, 9.99168124e-01],\n",
              "       [9.99999171e-01, 2.00000698e+00, 2.00003661e+00, 9.99834205e-01]])"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 115
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "FmXCjjnyfcfY",
        "colab_type": "code",
        "outputId": "819c8029-12d3-4344-e5af-b352b51507a3",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 35
        }
      },
      "source": [
        "# 重构误差\n",
        "\n",
        "loss(X, X_)"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "4.001908238790242"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 116
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "dGu-tGDxhEcf",
        "colab_type": "text"
      },
      "source": [
        "### 使用 sklearn 计算"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "sLN4FLmvb6tt",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "from sklearn.decomposition import NMF\n",
        "model = NMF(n_components=3, init='random', max_iter=200, random_state=0)\n",
        "W = model.fit_transform(X)\n",
        "H = model.components_"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "Fm5W6xQ0b_jl",
        "colab_type": "code",
        "outputId": "d2ae79f6-0fea-47ba-a2ba-1742b78f71b1",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 125
        }
      },
      "source": [
        "W"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "array([[0.        , 0.53849498, 0.        ],\n",
              "       [0.        , 1.07698996, 0.        ],\n",
              "       [0.69891361, 0.        , 0.        ],\n",
              "       [1.39782972, 0.        , 1.97173859],\n",
              "       [0.        , 0.        , 0.65783848],\n",
              "       [1.39783002, 1.34623756, 0.65573258]])"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 108
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "wheFi8ZwcCY9",
        "colab_type": "code",
        "outputId": "9dd57216-889a-4cc4-ccb8-82e790595d59",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 71
        }
      },
      "source": [
        "H"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "array([[0.00000000e+00, 0.00000000e+00, 1.43078959e+00, 1.71761682e-03],\n",
              "       [7.42810976e-01, 1.48562195e+00, 0.00000000e+00, 3.30264644e-04],\n",
              "       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 1.52030365e+00]])"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 109
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "9hfj3bRXgHRb",
        "colab_type": "code",
        "outputId": "3be1a4d8-a160-4200-d4e7-f5c92f2aa1af",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 125
        }
      },
      "source": [
        "X__ = inverse_transform(W, H);X__"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "array([[3.99999983e-01, 7.99999966e-01, 0.00000000e+00, 1.77845853e-04],\n",
              "       [7.99999966e-01, 1.59999993e+00, 0.00000000e+00, 3.55691707e-04],\n",
              "       [0.00000000e+00, 0.00000000e+00, 9.99998311e-01, 1.20046577e-03],\n",
              "       [0.00000000e+00, 0.00000000e+00, 2.00000021e+00, 3.00004230e+00],\n",
              "       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 1.00011424e+00],\n",
              "       [1.00000003e+00, 2.00000007e+00, 2.00000064e+00, 9.99758185e-01]])"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 110
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "-iALOTKfgKzP",
        "colab_type": "code",
        "outputId": "80c73327-dd36-403f-aa20-d4e2d3f796a7",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 35
        }
      },
      "source": [
        "loss(X, X__)"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "4.0000016725824565"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 111
        }
      ]
    }
  ]
}